Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25061][SQL] Precedence for ThriftServer hiveconf commandline parameter #27041

Closed
wants to merge 5 commits into from

Conversation

ajithme
Copy link
Contributor

@ajithme ajithme commented Dec 29, 2019

What changes were proposed in this pull request?

In HiveClientImpl.scala when we create new SessionState, we prepare configuration and the options passed to start-thriftserver.sh via --hiveConf need to take precedence when creating HiveConf i.e in order
hadoopConf < sparkConf < overrideProps(--hiveconf) < extraConfig

Why are the changes needed?

As per the documentation here, https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html user can provide --hiveconf to override the hive configurations when using start-thriftserver.sh but as per the code, https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L182 here, the hive-site properties (part of hadoopConf) will override the configuration done from command line which is not as per expectation

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT along with Base UTs can Pass

@ajithme
Copy link
Contributor Author

ajithme commented Dec 29, 2019

@ajithme ajithme changed the title [SPARK-25061] Spark SQL Thrift Server fails to not pick up hiveconf passing parameter [SPARK-25061] Precedence for ThriftServer hiveconf commandline parameter Dec 29, 2019
@srowen
Copy link
Member

srowen commented Dec 29, 2019

I think @yhuai wrote the comment above about why it's processed in that order, so may be able to review better

@ajithme
Copy link
Contributor Author

ajithme commented Jan 8, 2020

gentle ping @yhuai @dongjoon-hyun @HyukjinKwon

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-25061] Precedence for ThriftServer hiveconf commandline parameter [SPARK-25061][SQL] Precedence for ThriftServer hiveconf commandline parameter Jan 12, 2020
@dongjoon-hyun
Copy link
Member

ok to test

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ajithme .
In general, we need one of the followings.

  1. A unit test case.
  2. An explicit reproducible procedure.

In the section Tested this patch manually of the PR description, could you elaborate your procedure to verify this PR? Please write what you did with the command lines and show the result of before and after this PR.

@SparkQA
Copy link

SparkQA commented Jan 12, 2020

Test build #116541 has finished for PR 27041 at commit 6969ec2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Could you check the UT failure, @ajithme ?

@ajithme
Copy link
Contributor Author

ajithme commented Jan 13, 2020

Could you check the UT failure, @ajithme ?

Sure, will update shortly

@ajithme
Copy link
Contributor Author

ajithme commented Jan 13, 2020

Found the test case failures due to org.apache.spark.sql.hive.client.HiveClientImpl#extraConfig sent by org.apache.spark.sql.hive.HiveUtils#newTemporaryConfiguration were lost due to overriddenHiveProps. Hence fixing the order i.e
hadoopConf < sparkConf < overrideProps< extraConfig

Please retest. Will update the PR description with manual steps for verification shortly

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116656 has finished for PR 27041 at commit cd704bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ajithme
Copy link
Contributor Author

ajithme commented Jan 14, 2020

@dongjoon-hyun I have updated the PR with testcase failure correction also added a UT to reproduce and verify the issue. Please review

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116713 has finished for PR 27041 at commit 5043b0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val confMap = (hadoopConf.iterator().asScala.map(kv => kv.getKey -> kv.getValue) ++
sparkConf.getAll.toMap ++ extraConfig).toMap
sparkConf.getAll.toMap ++ overriddenHiveProps ++ extraConfig).toMap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we should update https://github.com/apache/spark/pull/27041/files#diff-6fd847124f8eae45ba2de1cf7d6296feR170-R179 and also explain why extraConfig is at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai Sure, I have updated the PR with reasonable pointers for the order. Does it suffice it now.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. As getConfSystemProperties will get all of hive confs that are in the system properties, it is possible that we will pull in a config that is not set by --hiveconf. Seems we are introducing a behavior change? Can you explain the impact of this change and why this change is fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, even without my changes in this PR, the HiveConf always considers the hive confs in system properties which were not set via --hiveconf as part of HiveConf constructor i.e Refer https://github.com/apache/hive/blob/rel/release-2.3.5/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L4079 (same behaviour for hive 1.2.1 as well) hence, this do not change any flow

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116725 has finished for PR 27041 at commit 2ee9f61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ajithme
Copy link
Contributor Author

ajithme commented Feb 18, 2020

gentle ping @dongjoon-hyun @yhuai @cloud-fan @HyukjinKwon can we get this fix in 3.0.?


// not to lose command line overwritten properties
// make a copy overridden props so that it can be reinserted finally
val overriddenHiveProps = HiveConf.getConfSystemProperties.asScala
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we totally ignore the --hive-conf previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was handled as --hiveconf as part of HiveConf constructor i.e Refer https://github.com/apache/hive/blob/rel/release-2.3.5/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L4079 ,
i.e first it loads hive-site and then it adds --hiveconf properties on top.

But in spark we again add hadoopConf on top of it hence overwriting HiveConf order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more code comment to convince people that HiveConf.getConfSystemProperties contains only the --hiveconf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan updated. is the comment adequate now?

@SparkQA
Copy link

SparkQA commented Feb 18, 2020

Test build #118644 has finished for PR 27041 at commit 3718df9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

@wangyum @yaooqinn can you take a look?

@yaooqinn
Copy link
Member

Thanks for pinging me @cloud-fan

We can override hive configurations in many ways.

Take hive.metastore.uris for an example, we can reset it via --conf spark.hadoop.hive.metastore.uris=thrift://example.com:9083, or --conf spark.hive.hive.metastore.uris=thrift://example.com:9083 or --hiveconf hive.metastore.uris=thrift://example.com:9083 or maybe(not sure) --conf spark.driver.extraJavaOptions=-Dxxx.

This PR seems to prefer --hiveconf than others.

Personally, I prefer those spark configurations always have higher precedence than other type
configurations including hive/hadoop/java/system, etc, as we are writing spark applicaitons.

@ajithme
Copy link
Contributor Author

ajithme commented Feb 19, 2020

Thanks for pinging me @cloud-fan

We can override hive configurations in many ways.

Take hive.metastore.uris for an example, we can reset it via --conf spark.hadoop.hive.metastore.uris=thrift://example.com:9083, or --conf spark.hive.hive.metastore.uris=thrift://example.com:9083 or --hiveconf hive.metastore.uris=thrift://example.com:9083 or maybe(not sure) --conf spark.driver.extraJavaOptions=-Dxxx.

This PR seems to prefer --hiveconf than others.

Personally, I prefer those spark configurations always have higher precedence than other type
configurations including hive/hadoop/java/system, etc, as we are writing spark applications.

Thanks @yaooqinn for your thoughts. This seems little confusing to know who is overriding as per the documentation mentioned in https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server

I agree with your opinion of having sparkConf as most precedence, but command line (--hiveconf) should be preferred over config file (hive-site.xml)

For the case you mentioned (X marks the conf is used)

type case 1 case 2 case 3 case 4
--conf spark.hadoop.hive.* X X - -
--conf spark.hive.hive.* X - X -
--hiveconf X X X X
hive-site.xml X X X X
Preference ? --conf spark.hadoop.hive.* --conf spark.hive.* --hiveconf

so do you mean, in case 1, 2, 3 where spark.* conf is used it must get preference.?
how about case 4.?

i prefer in case 4 --hiveconf has precedence and rest cases spark conf can have higher precedence

@yaooqinn
Copy link
Member

I have no stong option about case 4.

BTW, you should also pay attention to SparkSQLCLIDriver where the hive sesstionState is initialized before the spark context

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 30, 2020
@github-actions github-actions bot closed this May 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants